Introduction

For my end of studies mini-senior project, I found a dataset containing Airbnb rental data for European cities for over 51,000 listings. The dataset includes features such as the total listing price, room type, host status, amenities and location information which can be exploited to analyze these factors’ relation to the Airbnb prices. For more information see: https://www.kaggle.com/datasets/thedevastator/airbnb-price-determinants-in-europe?resource=download

We have data for several European cities for both weekdays & week-ends. So let’s first begin by importing all the different data into one aggregate set:

# data frame for storing dataset
combined_data <- data.frame()

# list of cities & data types
cities <- c("amsterdam", "athens", "barcelona", "berlin", "budapest", "lisbon", "london", "paris", "rome", "vienna")
data_types <- c("weekdays", "weekends")

# import data from each file into combined_data
for (city in cities) {
  for (data_type in data_types) {
    # file_path for CSV file
    file_path <- paste("data/", city, "_", data_type, ".csv", sep = "")
    
    # import CSV file
    city_data <- read.csv(file_path)
    
    # Add variables to identify city and data_type (weekend or week-day)
    city_data$city <- city
    city_data$data_type <- data_type
    
    # Import into combined data
    combined_data <- rbind(combined_data, city_data)
  }
}

Now let’s check the top rows of the data to get an idea on what we’re working with:

head(combined_data, 15)
##     X   realSum       room_type room_shared room_private person_capacity
## 1   0  194.0337    Private room       False         True               2
## 2   1  344.2458    Private room       False         True               4
## 3   2  264.1014    Private room       False         True               2
## 4   3  433.5294    Private room       False         True               4
## 5   4  485.5529    Private room       False         True               2
## 6   5  552.8086    Private room       False         True               3
## 7   6  215.1243    Private room       False         True               2
## 8   7 2771.3074 Entire home/apt       False        False               4
## 9   8 1001.8044 Entire home/apt       False        False               4
## 10  9  276.5215    Private room       False         True               2
## 11 10  909.4744 Entire home/apt       False        False               2
## 12 11  319.6401    Private room       False         True               2
## 13 12  675.6028 Entire home/apt       False        False               4
## 14 13  552.8086 Entire home/apt       False        False               2
## 15 14  209.0315    Private room       False         True               2
##    host_is_superhost multi biz cleanliness_rating guest_satisfaction_overall
## 1              False     1   0                 10                         93
## 2              False     0   0                  8                         85
## 3              False     0   1                  9                         87
## 4              False     0   1                  9                         90
## 5               True     0   0                 10                         98
## 6              False     0   0                  8                        100
## 7              False     0   0                 10                         94
## 8               True     0   0                 10                        100
## 9              False     0   0                  9                         96
## 10             False     1   0                 10                         88
## 11             False     0   0                 10                         96
## 12              True     1   0                 10                         97
## 13             False     0   0                  8                         87
## 14              True     0   0                 10                        100
## 15             False     1   0                  8                         96
##    bedrooms      dist metro_dist attr_index attr_index_norm rest_index
## 1         1 5.0229638  2.5393800   78.69038        4.166708   98.25390
## 2         1 0.4883893  0.2394039  631.17638       33.421209  837.28076
## 3         1 5.7483119  3.6516213   75.27588        3.985908   95.38695
## 4         2 0.3848620  0.4398761  493.27253       26.119108  875.03310
## 5         1 0.5447382  0.3186926  552.83032       29.272733  815.30574
## 6         2 2.1314201  1.9046682  174.78896        9.255191  225.20166
## 7         1 1.8810916  0.7297467  200.16765       10.599010  242.76552
## 8         3 1.6868070  1.4584036  208.80811       11.056528  272.31382
## 9         2 3.7191414  1.1961124  106.22646        5.624761  133.87620
## 10        1 3.1423614  0.9244044  206.25286       10.921226  238.29126
## 11        1 1.0099220  0.9171151  409.85812       21.702260  555.11428
## 12        1 2.1827071  1.5903814  191.50134       10.140123  229.29740
## 13        1 2.9330458  0.6280730  214.92334       11.380334  269.62490
## 14        1 1.3054939  1.3421624  325.25595       17.222519  390.91205
## 15        1 7.3045353  3.7208139   59.77618        3.165188   75.70106
##    rest_index_norm     lng      lat      city data_type
## 1         6.846473 4.90569 52.41772 amsterdam  weekdays
## 2        58.342928 4.90005 52.37432 amsterdam  weekdays
## 3         6.646700 4.97512 52.36103 amsterdam  weekdays
## 4        60.973565 4.89417 52.37663 amsterdam  weekdays
## 5        56.811677 4.90051 52.37508 amsterdam  weekdays
## 6        15.692376 4.87699 52.38966 amsterdam  weekdays
## 7        16.916251 4.91570 52.38296 amsterdam  weekdays
## 8        18.975219 4.88467 52.38749 amsterdam  weekdays
## 9         9.328686 4.86459 52.40175 amsterdam  weekdays
## 10       16.604478 4.87600 52.34700 amsterdam  weekdays
## 11       38.681161 4.87956 52.36953 amsterdam  weekdays
## 12       15.977773 4.92496 52.37107 amsterdam  weekdays
## 13       18.787851 4.88934 52.34697 amsterdam  weekdays
## 14       27.239314 4.87417 52.37509 amsterdam  weekdays
## 15        5.274959 4.99679 52.35645 amsterdam  weekdays
summary(combined_data)
##        X           realSum          room_type         room_shared       
##  Min.   :   0   Min.   :   34.78   Length:51707       Length:51707      
##  1st Qu.: 646   1st Qu.:  148.75   Class :character   Class :character  
##  Median :1334   Median :  211.34   Mode  :character   Mode  :character  
##  Mean   :1621   Mean   :  279.88                                        
##  3rd Qu.:2382   3rd Qu.:  319.69                                        
##  Max.   :5378   Max.   :18545.45                                        
##  room_private       person_capacity host_is_superhost      multi       
##  Length:51707       Min.   :2.000   Length:51707       Min.   :0.0000  
##  Class :character   1st Qu.:2.000   Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :3.000   Mode  :character   Median :0.0000  
##                     Mean   :3.162                      Mean   :0.2914  
##                     3rd Qu.:4.000                      3rd Qu.:1.0000  
##                     Max.   :6.000                      Max.   :1.0000  
##       biz         cleanliness_rating guest_satisfaction_overall
##  Min.   :0.0000   Min.   : 2.000     Min.   : 20.00            
##  1st Qu.:0.0000   1st Qu.: 9.000     1st Qu.: 90.00            
##  Median :0.0000   Median :10.000     Median : 95.00            
##  Mean   :0.3502   Mean   : 9.391     Mean   : 92.63            
##  3rd Qu.:1.0000   3rd Qu.:10.000     3rd Qu.: 99.00            
##  Max.   :1.0000   Max.   :10.000     Max.   :100.00            
##     bedrooms           dist            metro_dist          attr_index     
##  Min.   : 0.000   Min.   : 0.01504   Min.   : 0.002301   Min.   :  15.15  
##  1st Qu.: 1.000   1st Qu.: 1.45314   1st Qu.: 0.248480   1st Qu.: 136.80  
##  Median : 1.000   Median : 2.61354   Median : 0.413269   Median : 234.33  
##  Mean   : 1.159   Mean   : 3.19129   Mean   : 0.681540   Mean   : 294.20  
##  3rd Qu.: 1.000   3rd Qu.: 4.26308   3rd Qu.: 0.737840   3rd Qu.: 385.76  
##  Max.   :10.000   Max.   :25.28456   Max.   :14.273577   Max.   :4513.56  
##  attr_index_norm      rest_index      rest_index_norm         lng         
##  Min.   :  0.9263   Min.   :  19.58   Min.   :  0.5928   Min.   :-9.2263  
##  1st Qu.:  6.3809   1st Qu.: 250.85   1st Qu.:  8.7515   1st Qu.:-0.0725  
##  Median : 11.4683   Median : 522.05   Median : 17.5422   Median : 4.8730  
##  Mean   : 13.4238   Mean   : 626.86   Mean   : 22.7862   Mean   : 7.4261  
##  3rd Qu.: 17.4151   3rd Qu.: 832.63   3rd Qu.: 32.9646   3rd Qu.:13.5188  
##  Max.   :100.0000   Max.   :6696.16   Max.   :100.0000   Max.   :23.7860  
##       lat            city            data_type        
##  Min.   :37.95   Length:51707       Length:51707      
##  1st Qu.:41.40   Class :character   Class :character  
##  Median :47.51   Mode  :character   Mode  :character  
##  Mean   :45.67                                        
##  3rd Qu.:51.47                                        
##  Max.   :52.64

Now let’s include the libraries we need to move forward:

library(ggplot2)
library(tidyr)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attachement du package : 'randomForest'
## L'objet suivant est masqué depuis 'package:ggplot2':
## 
##     margin
library(caret)
## Le chargement a nécessité le package : lattice
library(leaflet)
library(dplyr)
## 
## Attachement du package : 'dplyr'
## L'objet suivant est masqué depuis 'package:randomForest':
## 
##     combine
## Les objets suivants sont masqués depuis 'package:stats':
## 
##     filter, lag
## Les objets suivants sont masqués depuis 'package:base':
## 
##     intersect, setdiff, setequal, union
library(sf)
## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
library(readr)
library(corrplot)
## corrplot 0.92 loaded
library(RColorBrewer)
library(ggplotify)
library(grid)

Let’s do a data clean-up. I want to limit outliers by removing the rows with the highest and lowest 10% of listing price, turn our boolean text variables to integer binary variables, and create integer dummy variables for cities and for the listing being for weekends or weekdays.

# Convert text variables to boolean integers
combined_data$room_shared <- ifelse(combined_data$room_shared == "False", 0, 1)
combined_data$room_private <- ifelse(combined_data$room_private == "False", 0, 1)
combined_data$host_is_superhost <- ifelse(combined_data$host_is_superhost == "False", 0, 1)

# Create dummy variables to represent data_type
combined_data$for_weekends <- as.integer(combined_data$data_type == "weekends")
combined_data$for_weekdays <- as.integer(combined_data$data_type == "weekdays")

# Create dummy variable to represent full houses and apartments
combined_data$full_home <- as.integer(combined_data$room_type != "Private room")

# Add a dummy variable for each city
encoded_cities <- model.matrix(~ 0 + city, data = combined_data)
colnames(encoded_cities) <- sub("city", "", colnames(encoded_cities))
combined_data <- cbind(combined_data, encoded_cities)

# Remove 10% of outliers in terms of listing price both from the bottom and the top
percentile_10 <- quantile(combined_data$realSum, 0.1)
percentile_90 <- quantile(combined_data$realSum, 0.9)
filtered_data <- combined_data %>%
  filter(realSum >= percentile_10, realSum <= percentile_90)

original_data <- combined_data
combined_data <- filtered_data

Now let’s summarize our transformed data

summary(combined_data)
##        X           realSum       room_type          room_shared      
##  Min.   :   0   Min.   :113.2   Length:41369       Min.   :0.000000  
##  1st Qu.: 654   1st Qu.:160.4   Class :character   1st Qu.:0.000000  
##  Median :1365   Median :211.3   Mode  :character   Median :0.000000  
##  Mean   :1638   Mean   :234.8                      Mean   :0.005125  
##  3rd Qu.:2415   3rd Qu.:289.4                      3rd Qu.:0.000000  
##  Max.   :5378   Max.   :500.8                      Max.   :1.000000  
##   room_private    person_capacity host_is_superhost     multi       
##  Min.   :0.0000   Min.   :2.000   Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:0.0000    1st Qu.:0.0000  
##  Median :0.0000   Median :3.000   Median :0.0000    Median :0.0000  
##  Mean   :0.3684   Mean   :3.103   Mean   :0.2636    Mean   :0.2926  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:1.0000    3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.000   Max.   :1.0000    Max.   :1.0000  
##       biz         cleanliness_rating guest_satisfaction_overall    bedrooms    
##  Min.   :0.0000   Min.   : 2.000     Min.   : 20.00             Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.: 9.000     1st Qu.: 90.00             1st Qu.:1.000  
##  Median :0.0000   Median :10.000     Median : 95.00             Median :1.000  
##  Mean   :0.3476   Mean   : 9.407     Mean   : 92.73             Mean   :1.112  
##  3rd Qu.:1.0000   3rd Qu.:10.000     3rd Qu.: 98.00             3rd Qu.:1.000  
##  Max.   :1.0000   Max.   :10.000     Max.   :100.00             Max.   :9.000  
##       dist            metro_dist          attr_index      attr_index_norm   
##  Min.   : 0.03466   Min.   : 0.002301   Min.   :  15.15   Min.   :  0.9263  
##  1st Qu.: 1.42561   1st Qu.: 0.249888   1st Qu.: 143.33   1st Qu.:  6.7956  
##  Median : 2.63400   Median : 0.414120   Median : 237.73   Median : 11.5968  
##  Mean   : 3.17955   Mean   : 0.676397   Mean   : 296.63   Mean   : 13.1757  
##  3rd Qu.: 4.28928   3rd Qu.: 0.734352   3rd Qu.: 382.28   3rd Qu.: 16.9250  
##  Max.   :25.28456   Max.   :14.273577   Max.   :4513.56   Max.   :100.0000  
##    rest_index      rest_index_norm         lng                lat       
##  Min.   :  19.58   Min.   :  0.6407   Min.   :-9.22634   Min.   :37.95  
##  1st Qu.: 271.43   1st Qu.:  8.9507   1st Qu.:-0.07504   1st Qu.:41.41  
##  Median : 534.31   Median : 18.3118   Median : 4.86326   Median :47.50  
##  Mean   : 641.54   Mean   : 23.1876   Mean   : 7.17219   Mean   :45.62  
##  3rd Qu.: 836.29   3rd Qu.: 33.5754   3rd Qu.:13.44577   3rd Qu.:51.45  
##  Max.   :6696.16   Max.   :100.0000   Max.   :23.78602   Max.   :52.64  
##      city            data_type          for_weekends     for_weekdays   
##  Length:41369       Length:41369       Min.   :0.0000   Min.   :0.0000  
##  Class :character   Class :character   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median :1.0000   Median :0.0000  
##                                        Mean   :0.5068   Mean   :0.4932  
##                                        3rd Qu.:1.0000   3rd Qu.:1.0000  
##                                        Max.   :1.0000   Max.   :1.0000  
##    full_home        amsterdam           athens          barcelona      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.6316   Mean   :0.02826   Mean   :0.07994   Mean   :0.05777  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##      berlin           budapest           lisbon           london      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.05248   Mean   :0.07982   Mean   :0.1228   Mean   :0.1849  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##      paris             rome            vienna       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.1274   Mean   :0.1882   Mean   :0.07842  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000

Now let’s proceed with some basic exploratory data analysis just to get an idea on how the listing price varies

# Price histogram
ggplot(combined_data, aes(x = realSum)) +
  geom_histogram(fill = "steelblue", bins = 30) +
  labs(x = "Price", y = "Frequency", title = "Distribution of Prices") +
  theme_minimal()

# Price by room type
ggplot(combined_data, aes(x = room_type, y = realSum)) +
  geom_boxplot(fill = "steelblue") +
  labs(x = "Room Type", y = "Price", title = "Price Variation by Room Type") +
  theme_minimal()

# Price vs. distance to metro
ggplot(combined_data, aes(x = metro_dist, y = realSum)) +
  geom_point(color = "steelblue") +
  labs(x = "Distance to Metro", y = "Price", title = "Price vs. Distance to Metro") +
  theme_minimal()

# Price by city
ggplot(combined_data, aes(x = city, y = realSum)) +
  geom_boxplot(fill = "steelblue") +
  labs(x = "City", y = "realSum", title = "Distribution of realSum by City") +
  theme_minimal()

# Price by data_type (weekends and weekdays)
ggplot(combined_data, aes(x = data_type, y = realSum)) +
  geom_boxplot(fill = "steelblue") +
  labs(x = "Data Type", y = "realSum", title = "Distribution of realSum by Date") +
  theme_minimal()

We can see that the listing price isn’t normally distributed. And we can also see that entire homes and apartments are priced higher than private rooms, which themselves are priced higher that shared rooms. And we can see that distance to metro_stations is somewhat negatively correlated to the listing price. And we can see that there is significant variation between cities. but listing prices between week_days and weekends aren’t very different.

Let’s take a look at the number of listings per city in our sample

# Define the cities
cities <- c("amsterdam", "athens", "barcelona", "berlin", "budapest", "lisbon", "london", "paris", "rome", "vienna")

# Generate a color palette
n_colors <- length(cities)
color_palette <- brewer.pal(n_colors, "Set3")

# Create a named vector of colors
city_colors <- setNames(color_palette, cities)
# Number of listings per city with numbers in the legend
whole_dataset_pie <- original_data %>%
  group_by(city) %>%
  summarize(count = n()) %>%
  ggplot(aes(x = "", y = count, fill = city)) +
  geom_bar(stat = "identity", width = 1) +
  geom_text(aes(label = count), position = position_stack(vjust = 0.5), size = 2.5, color = "black") +  # Add labels to the bars
  coord_polar(theta = "y") +
  scale_fill_manual(values = city_colors) +
  labs(title = "Number of Listings per City (Whole Dataset)") +
  theme_void()  # Use theme_void to create a clear background

# Print the pie chart with numbers in the legend
print(whole_dataset_pie)

Let’s try looking at these on a map

# Create an sf object
combined_sf <- st_as_sf(original_data, coords = c("lng", "lat"), crs = 4326)

# Create a base leaflet map
m <- leaflet() %>%
  addTiles(urlTemplate = "https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png") %>% # Use high-resolution tile source
  setView(lng = 10, lat = 50, zoom = 6) # Increased initial zoom level

# Add markers with different colors based on room_type
m <- m %>%
  addCircleMarkers(
    data = combined_sf,
    fillColor = ~case_when(
      room_type == "Entire home/apt" ~ "blue",
      room_type == "Private room" ~ "orange",
      room_type == "Shared room" ~ "red",
      TRUE ~ "blue" # Use a default color for other cases
    ),
    fillOpacity = 0.7, # Adjust opacity
    radius = 5, # Adjust marker size
    group = "Airbnb Listings", # Group for layer control
    popup = ~paste("City: ", city, "<br>Room Type: ", room_type) # Popup content
  )

# Add layer control for toggling layers on/off
m <- m %>%
  addLayersControl(overlayGroups = "Airbnb Listings", position = "topleft")

# Center the map and improve appearance
m <- m %>%
  setView(lng = 10, lat = 50, zoom = 6) %>%
  htmlwidgets::onRender("
    function(el, x) {
      setTimeout(function() {
        map.invalidateSize();
      }, 100);
    }
  ")

# Display the map
m

Let’s proceed with a correlation matrix to get an idea on which explanatory variables correlate to the listing price, and also get an idea on the correlation between dependent variables.

# Correlation matrix
cor_matrix <- cor(combined_data[, c("realSum", "person_capacity", "cleanliness_rating", "guest_satisfaction_overall", "bedrooms", "dist", "metro_dist", "rest_index", "attr_index", "rest_index_norm", "attr_index_norm", "lng", "lat", "biz", "host_is_superhost", "room_shared", "room_private", "for_weekends", "for_weekdays", "full_home", "multi")])

# Plot correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust", tl.cex = 0.7)

# Center and improve appearance
par(mar = c(1, 1, 1, 1))

I chose to implement a random forest algorithm to see if I can find non-linear relationships between pricing and the exploratory variables, since a conventional linear regression did not work on our aggregate data. I first executed the RF algorithm using all the numerical explanatory variables, then using feature importance analysis to remove variables that didn’t correlate, and then in cases of collinearities in the correlation matrix, removing the variable with the least score on the feature importance analysis.

# Independent variables
predictors <- c(
  "lat",
  "lng",
  "attr_index_norm",
  "dist",
  "bedrooms",
  "guest_satisfaction_overall",
  "barcelona", "london", "host_is_superhost", "multi", "biz", "amsterdam"
)

data_subset <- combined_data[, c(predictors, "realSum")]

# Split the data into a training set and a testing set
set.seed(123) # For reproducibility
sample_index <- sample(1:nrow(data_subset), 0.7 * nrow(data_subset))
train_data <- data_subset[sample_index, ]
test_data <- data_subset[-sample_index, ]

# Train the Random Forest model
rf_model <- randomForest(realSum ~ ., data = train_data, ntree = 500)

# Make predictions on the test set
predictions <- predict(rf_model, test_data)

# Evaluate the model
rmse <- sqrt(mean((test_data$realSum - predictions)^2))
mae <- mean(abs(test_data$realSum - predictions))

# Print the evaluation metrics
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 55.90214
cat("Mean Absolute Error (MAE):", mae, "\n")
## Mean Absolute Error (MAE): 40.32739

the RMSE & MAE are acceptable in comparison to our mean and median. So the model seems to be adequate

# Calculate R-squared
r_squared <- 1 - (sum((test_data$realSum - predictions)^2) / sum((test_data$realSum - mean(test_data$realSum))^2))
print(paste("R-squared (R²):", r_squared))
## [1] "R-squared (R²): 0.637456402396933"
# Calculate MAPE
mape <- mean(abs((test_data$realSum - predictions) / test_data$realSum)) * 100
print(paste("Mean Absolute Percentage Error (MAPE):", mape, "%"))
## [1] "Mean Absolute Percentage Error (MAPE): 18.1648715125016 %"

And our R squared and MAPE values are acceptable, though they are not ideal.

# Compare observed values and predicted values
plot(test_data$realSum, predictions, 
     xlab = "Observed Price",
     ylab = "Predicted Price",
     main = "Comparison of Observed and Predicted Prices",
     col = "blue",
     pch = 16)
     
# Add a diagonal reference line
abline(0, 1, col = "red")

Let’s take a look at the residuals plot to see if there is any pattern we can see in the error terms

# Calculate residuals
residuals <- test_data$realSum - predictions

# Plot residuals against predicted values
plot(predictions, residuals,
     xlab = "Predicted Price",
     ylab = "Residuals",
     main = "Residual Plot",
     col = "blue",
     pch = 16)

# Add a horizontal reference line at y = 0
abline(h = 0, col = "red")

Let’s look into feature importance to see if we have any redundant explanatory variables

library(randomForest)
library(caret)
# Create the feature importance plot
importance <- importance(rf_model)
varImpPlot(rf_model, pch = 19, col = "blue", bg = "white", main = "Feature Importance")

In conclusion, in our samples there is considerable variance between cities in terms of listing prices. Our model shows this with the significant correlation of price with longitudes and latitudes, which incidentally was more correlated than the city dummy variables. The price is also correlated to the indexes relating to distance to restaurants and attractions, but since they are correlated to each other we only used one for the random forest model to prevent collinearity. And the same idea applies to the number of bedrooms and the distance to city centers.

Let’s perform K-fold cross-validation

# Create a data frame with only the selected predictors and the target variable (price)
data_subset <- combined_data[, c(predictors, "realSum")]

# Define the number of folds (K) for cross-validation
num_folds <- 5  # You can adjust this as needed

# Create a training control object for cross-validation
train_control <- trainControl(
  method = "cv",          # Use K-fold cross-validation
  number = num_folds,     # Number of folds
  verboseIter = TRUE      # Display progress
)

# Apply K-fold cross-validation to your existing rf_model
set.seed(123)  # For reproducibility
cv_results <- train(
  realSum ~ .,             # Formula for the target variable
  data = data_subset,     # Data frame
  method = "rf",          # Random forest method
  trControl = train_control,  # Training control
  tuneGrid = data.frame(mtry = 3)  # Adjust mtry as needed
)
## + Fold1: mtry=3 
## - Fold1: mtry=3 
## + Fold2: mtry=3 
## - Fold2: mtry=3 
## + Fold3: mtry=3 
## - Fold3: mtry=3 
## + Fold4: mtry=3 
## - Fold4: mtry=3 
## + Fold5: mtry=3 
## - Fold5: mtry=3 
## Aggregating results
## Fitting final model on full training set
# Print the cross-validation results, including metrics
print(cv_results)
## Random Forest 
## 
## 41369 samples
##    12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 33096, 33096, 33095, 33093, 33096 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   56.81461  0.6472051  42.45537
## 
## Tuning parameter 'mtry' was held constant at a value of 3

As we can see the results of this test reinforce the initial testing of the random forest model.